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reliability than does the average response latency, which in turn has 
a higher reliability than do the test scores; (3) a few of the 
findings are ambivalent since some results suggest that equivalence 
estimates for computer-based and paper-based measures (i.e., test 
score and average degree of confidence) are about the same, and 
another suggests that these estimates are different; and (4) the 
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FOREWORD 

This research was performed under Exploratory Development work unit RF63- 
522-801-013-03.04, Testing Strategies for Operational Computer-based Tniining, under 
the sponsorship of the Office of Naval Technology, and Advanced Development pro- 
ject Z1772-ET008, Computer-Based Performance Testing, under the sponsorship of 
Deputy Chief of Naval Operations (Manpower, Personnel, and Training). The general 
goal of this development is to create and evaluate computer-based representations of 
operationally oriented tasks to determine if they result in better assessment of student 
performance than more customary measurement methods. 

The results of this study are primarily intended for the Department of Defense 
training and testing research cmd development community. 



B. E. BACON 
Captain, U.S. Navy 
Commanding Officer 
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Technical Director 



SUMMARY 



Problems 

Many student assessment schemes currently used in Navy training are suspected 
of being insufficiently accurate or consistent. If tnie, this could result in either over- 
training, which increases costs needlessly, or undertraining, which culminates in 
unqualified graduates being sent to the fleets. 

Objective 

The specific objective of this research was to compare the reliability and validity 
of a computer-based and a paper-based procedure for assessing semantic knowledge. 

Method 

A Soviet threat-parameter database was compiled with the assistance of intelli- 
gence officers and instructors at VF-124, Naval Air Station (NAS) Miramar. This was 
structured as a semantic network in order to represent the associative knowledge 
inherent to it for the computer system. That is, objects and their corresponding proper- 
ties, attributes, or characteristics were represented as node-link structures. The links 
between the nodes represent the associations or relationships among objects or among 
objects and their attributes. 

A computer-based and paper-based test were designed and developed to assess 
this threat-parameter knowledge. Using a within-subjects experimental design, these 
tests were administered to 75 F-14 and E-2C crew members who volunteered to parti- 
c ^ ate in this study. After subjects received one test, they were immediately given the 
other. It was assumed that a subject's state of threat-parameter knowledge was the 
same during the administration of both tests. 

Reliabilities for both modes of testing were estimated by deriving internal con- 
sistency indices using an odd-even item split. These estimates were adjusted by 
employing the Spearman-Brown Prophecy Formula. Reliability estimates were calcu- 
lated for test score, average degree of confidence, and average response latency for the 
computer-based test; reliability estimates were calculated for test score and average 
degree of confidence only for the paper-based test. None was computed for average 
response latency since this was not measured for the paper-based test. Equivalences 
between these two modes of assessment were estimated by Pearson product-moment 
correlations for total test score ahd average degree of confidence. 

In order to derive discriminant validity estimates, research subjects were placed 
into groups according to three distinct grouping strategies: (a) above or below F-14 or 
E-2C mean flight hours, (b) F-14 radar intercept officers (RIOs) or pilots and E-2C 
naval flight officers (NFOs) or pilots, and (c) VF-124 students and instructors or 
members of other operational squadrons. Three stepwise multiple discriminant ana- 
lyses, using Wilks' criterion for including and rejecting variables, and their associated 
statistics were computed to ascertain how well computer-based and paper-based 
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measures distinguished among the defined groups expected to differ in the extent of 
their knowledge of the threat-parameter database. 



Results 

This study established that (a) computer-based and paper-based measures, i.e., test 
score and average degree of confidence, are not significantly different in reliability or 
internal consistency; (b) for computer-based and paper-based measures, average degree 
of confidence has a higher reliability than average response latency which in turn has a 
higher reliability than the test score; (c) a few of the findings are ambivalent since 
some results suggest equivalence estimates for computer-based and paper-based meas- 
ures, i.e., test score and average degree of confidence, are about the same, and another 
suggests these estimates are different; and (d) the discriminant validity of the 
computer-based measures was superior to paper-based measures. 



Discussion and Conclusions 

In this study, computer-based and paper-based testing were not significantly 
different in reliability with the former having more discriminant validity than the latter. 
These results suggest that computer-based assessment may have more utility for 
measuring semantic knowledge than paper-based measurement. This implies that the 
type of computerized testing used in this research may be better for estimating threat- 
parameter biowled'ge than traditional testing which has been primarily paper-based in 
nature. 

The literature regarding computer-based assessment is contradictory and incon- 
clusive: Many benefits may be obtained from computerized testing. Some of these may 
be related to attitudes and assumptions associated with the use of novel media or inno- 
vative technology per se. However, and just as readily, potential problems may result 
from the employment of computer-based measurement. Differences between this mode 
of assessment and traditional testing techniques may, or may not, impact upon the 
reliability and validity of measurement. 



Recommendations 

1. It is recommended that the computer-based test, FlashCards, be used to not 
only quiz but also train the threat-parameter database to F-14 and E-2C aew 
members. Currently, FlashCards and Jeopardy (the Computhreat system) are being 
used by VF-124 to augment the teaching and testing of threat parameters. 

2. Other computer-based quizzes being developed at NPRDC should be used in 
different content areas to provide evidence about the generalizabiltiy of the reliability 
and validity findings established in this research. 
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INTRODUCTION 



Problems 

Many student assessment schemes currently used in Navy training are suspected 
of being insufficiently accurate or consistent. If true, this could result in either over- 
training, which increases costs needlessly, or undertraining, which culminates in 
unqualified graduates being sent to the fleet commands. Many customary methods for 
measuring performance either on the job or in the classroom involve instruments which 
are primarily paper-based in nature (e.g., check lists, rating scales, critical incidences, 
and multiple-choice, completion, true-false, and matching formats). A number of 
deficiencies exist with these traditional testing techniques; e.g., (a) biased items are 
generated by different individuals, (b) item-writing procedures are usually obscure, (c) 
there is a lack of objective standards for producing tests, (d) item content is not typi- 
cally sampled in a systematic manner, and (e) there is often a poor relationship 
between what is taught and test content. 

What is required is a theoretically and empirically grounded technology of pro- 
ducing procedures for testing which will correct these faults. One promising approach 
employs computer technology. However, very few data are available regarding the 
psychometric properties of testing strategies using this technology. Data are needed 
concerning the accuracy, consistency, sensitivity, and fidelity of these computer-based 
assessment schemes compared to more traditional testing techniques. 

Objeciive 

The specific objective of this research was to compare the reliability and validity 
of a computer-based and a paper-based procedure for assessing semantic knowledge. 

METHOD 



Subjects 

The subjects were 75 F-14 pilots, radar intercept officers (RIOs), and students as 
well as E-2C pilots and naval flight officers (NFOs) from operational squadrons at 
Naval Air Station (NAS) Miramar who had volunteered to participate in this research. 
The primary test-bed has been the Fleet Replacement Squadron, VF-124, NAS 
Miramar. The main reason this squadron exists is to train pilots and RIOs for the F-14 
fighter. One of the major missions of the F-14 is to protect canier-based naval task 
forces against antiship, missile-launching, threat bombers. This part of the F-14*s mis- 
sion is referred to as Maritime Air Superiority (MAS), which is taught in the 
Advanced Fighter Air Superiority (ADFAS) curriculum in the squadron. It is during 
ADFAS that the students must learn a threat-parameter database so that they can prop- 
erly employ the F-14 against hostile platforms. E-2C pilots, NFOs, and students 
receive similar instruction. The tests currently administered to these officers are pri- 
marily paper-based in nature and normally formatted as multiple choice and 
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completion items. 



Subject Matter 

A classified database was developed consisting of five categories of facts about 
front-line Soviet platforms: weapons systems, radar and ECM systems, surface and 
subsurface platforms, airborne platforms, and counterjamming procedures. It was used 
to train and test F-14 pilots, RIOs, and students concerning important threat parameters 
associated with Russian platforms: e.g., aircraft range and speed, payload of antiship 
missiles, typical launch altitude; missile range, flight profile, velocity, and warheads; 
other weapon, radar, electronic countermeasure (ECM)/ electronic counter- 
countermeasure (ECCM) systems; and surveillance capabilities. 

The database was compiled with the assistance of the intelligence officers and the 
ADFAS instructors of VF-124. It was structured as a semantic network (Barr & 
Feigenbaum, 1981; Johnson-Lah-d, 1983) in order to represent the associative 
knowledge inherent to it for the computer system. That is, objects and their 
corresponding properties, attributes, or characteristics were represented as node-link 
structures. The links between those nodes represent the associations or relationships 
among objects or among objects and their attributes. For example, the object "aircraft 
type" and the attribute "ECM suite" can be linked so that the system can represent a 
particular aircraft type that has a certain ECM suite. By defining initially all objects 
and attributes in the database, a hierarchy or tree structure can be specified for all 
objects, attributes, and their relationships. A typical database can contain representa- 
tions of several thousands of such associations. The database can also include 
synonyms and quantifiers. The former allows an object to be specified or referred to 
in several ways; the latter allows the number of certain attributes to be associated with 
a particular object. 



Computer-Based Assessment 

Once a database was structured as a semantic network, it became possible for 
independent software, modules to interac* with, operate upon, or manipulate the data- 
base. For example, interpretative programs could make inferences about the subject 
database, or they could ask questions about the database since its intrinsic structure 
was represented. This latter capability was capitalized upon in this research. 

A computer-based game was adopted and adapted to quiz students and instructors 
in VF-124 as well as crew members of other operational squadrons that belong to the 
wing at NAS Miramar about the threat-parameter database. This computer-based quiz, 
or test, is totally independent of the database and will run on any database structured 
as a semantic network. It will randomly select objects from the database, and generate 
questions about them and their attributes. Unlike some computer-based tests, alterna- 
tive forms did not have to be specifically programmed as such. 

With the database represented as a semantic network, it was feasible to employ 
one of the games or quizzes that was programmed as a component of the Computer- 
Based Tactical Memorization Training System developed by the Navy Personnel 
Research and Development Center (NPRDC) under the work unit entilted: Computer- 
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Baj;ed Techniques for Training Tactical Knowledge, RF63-522-801-013.03.02. To 
reiterate, the games are autonomous entities which can operate on any database that 
can be structured as a semantic network. These games can quiz students by randomly 
choosing characteristics or objects from the database, and generating questions about 
threat platforms and their salient attributes. 

One of the computer-based games that was chosen from this prior NPRDC 
development for conducting this research is called HashCards. It was substantially 
improved to yield: more experimental control, measures of response latencies and 
degrees of confidence in responses, and better record keeping for assessing student per- 
formance, facilitating the computation of statistical analyses, and presenting feedback 
to the instructors and students. These programming enhancements were documented by 
Liggett and Federico (1986). The computer-based system containing FlashCards and 
another game. Jeopardy together with the threat-parameter database for the F-14 and 
E-2C communities is referred to as Computhreat. 

FlashCards is analogous to using real flash cards. That is, a question is presented 
to individual students who are expected to answer it. Quesnons can have multiple 
answers as in "What Soviet bombers carry the XYZ-123 missile?" After individual stu- 
dents are presented v/ith the question, they are allowed as many tries as they would 
like to answer. If the students cannot answer the question, they can continue with the 
game. At this point, they are presented with the correct answer or answers. At any 
point in the answering process, they can continue to the next question. For each 
answer, the students must key in a response which reflects their degree of confidence 
in their answer. Also, for each answer, the student's response latency is recorded and 
displayed. 

FlashCards will quiz the students on all top-level, or general, categories of the 
semantic network that it is using as the database. After the game, students are given 
feedback as to their ov^^rall performance. FlashCards keeps records of a student's: 
latency, confidence, overall score, number answered con-ectly, number answered 
incorrectly, and number not answered. Records are kept across all items for each stu- 
dent. 

A question cycle begins with an individual student being prompted with a ques- 
tion and the number of correct answers required to fully answer that question. Also 
visible is an empty Correct Answers Menu which is a box structure that will hold all 
the correct answers. An answer will be placed there when an individual answers a 
question correctly, or gives up in which case the program divulges the correct 
answer(s). The testee is notified that a clock has started, and is then required to type in 
an answe:-. After typing <retum> at the end of the answer, the individual is given 
response time in seconds, and presented with a scale ranging from zero to one-hundred 
percent in ten point intervals to be used to indicate the percentage of confidence or the 
degree of sureness the testee has in the answer(s). The student is then required to type 
in a single digit corresponding to the selected confidence level. After the confidence 
value is entered, the testee is notified if the answer was correct or incorrect. If correct, 
the answer is put into the Correct Answers Menu and the number of answers left to be 
entered is decremented. If that number is zero, the question terminates and program 
control is passed to the next question. If the answer is incorrect, the individual is 
merely prompted again to enter an answer. If the testee does not know all the correct 
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answers, A <retum> may be entered to put all the remaining correct answers in the 
Correct Answers Mei^u* 

The score for each question was computed as the number of correct answers 
entered divided by the total number of answers entered. A <retum> was not counted 
as an answer. For the purposes of this research, a complete FlashCards test consisted 
of 25 domain-referenced items or questions. These were considered as two groups of 
12 odd and even items each, dropping tiie last question, for computing split-half relia- 
bility estimates. The average score for odd (even) items was calculated as the total 
score of odd (even) items divided by the number of odd (even) questions attempted. 
The total computer-based test score was calculated as the average of the odd and even 
halves. 

The software for the complete gaming system is currently on eight floppy disks. 
The game itself is nin with only two dual-density disks on a Terak microcomputer 
employing two drives. It is implemented on the UCSD P-system and written in 
UCSD-Pascal. The disk placed in the bottom drive holds the actual game code; the 
disk placed in the top drive contains the independent semantic network database. As 
soon as the system is booted, control is passed to the game. Consequently, naive users 
need not deal with the nuances o\ the UCSD P-sys^ m. Knowledge-performance data 
for the FlashCards game are saved for individual players on the disk in the lower 
drive. There are six other disks that contain files necessary for modification of the 
gaming system and/or data collection. These disks contain the text of the games, the 
semantic network database, the statistical programs, and all necessary P-system files. 

Paper-Based Assessment 

Two alternative forms of a paper-based test were desig^^ed and developed lo 
assess knowledge of the same threat-parameter database mentioned above, and to mim- 
ick as much as possible the format used by FlashCards. Both of these consisted of 25 
completion or fill-in-the-blank domain-referenced items. As with the computer-based 
test, more than one answer may be required per item or question. Beneath each ques- 
tion was a confidence scale which resembled the one ustd in FlashCards where the 
testees were required to indicate the level of confidence in their response(s). Scoring 
items for this paper-based test was similar to scoring the computer-based test: For each 
question, the number of corr^.rt answers given was divided by the total number Oi' 
answers completed for that question. Also, scoring odd (even) halves of the test for 
computing internal consistency was similar to tiiat for FlashCards. The score for the 
total paper-based test v/as calculated like the total score for the computer-based test. 

Procedure 

Subjects acquired threat-parameter knowledge using dual media: (1) a traditional 
text organized according to the database's major topics, and (2) the Computhreat 
computer-based system. Mode of assessment, computer-based or paper-based, was 
manipulated as a within-subjects variable. Subjects were administered the computer- 
based and paper-based tests in counterbalanced order. The two forms of the paper- 
based tests were alternated in their administration to subjects, i.e., the first subject 
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received Form A, the second subject received Form B, the third subject received Form 
A, etc. After subjects received one test, they were immediately administered the other. 
It was assumed that a subject's state of threat-parameter knowledge was the same dur- 
ing the administration of both tests. Subjects took approximately 10-15 minutes to 
complete the paper-based test, and 20-25 minutes to complete the computer-based test. 
The longer time to complete the latter test was largely attributed to lack of typing or 
keyboard proficiency on the part of some of the subjects. 

Reliabilities for both modes of testing were estimated by deriving internal con- 
sistency indices using an odd-even item split. These reliability estimates were adjusted 
by employing the Spearmaii-Brown Prophecy Formula (Thomdike, 1982). Reliability 
estimates were calculated only for test score, average degree of confidence, and aver- 
age response latency for the computer-based test; reliability estimates were calculated 
for test score and average degree of confidence for the paper-based test. None was 
computed for average response latency since this was not measured for the paper-based 
test. Equivalences between the two modes of assessment were estimated by Pearson 
product-moment correlations for total test score and average degree of confidence. 
These correlations were considered indices of the extent to which the tv/o t>pes of test- 
ing were measuring the same semantic knowledge and amount of assurance in 
answers. 

In order to derive discriminant validity estimates, research subjects were placed 
into groups according to ichree distinct grouping strategies: (a) above or below F-14 or 
E-2C mean flight hours, (b) F-14 RIOs or pilots and E-2C NFOs or pilots, and (c) 
VF-124 students and instructors or members of other operational squadrons. Three 
stepwise multiple discriminant analyses, using Wilks' criterion for including and reject- 
ing variables, and their associated statistics were computed to ascertain how well 
computer-based and paper-based measures distinguished among the defined groups 
expected to differ in the extent of their knowledge of the threat-parameter database. It 
was thought that mean flight hours reflect operational experience. Those individuals 
with more operational experience were expected to perform better on tests of threat- 
parameter knowledge than those with less experience. It was thought that F-14 crew 
members would have knowledge superior to E-2C crew members regarding threat 
parameters because of the difference in their operational missions and training 
emphasis. Lastly, it was expected that students would do better on tests of threat- 
parameter knowledge because their exposur'^ to this subject matter was more recent to 
that of instructors and members of other operational crews who probably had not 
reviewed this material for sometime. 



RESULTS 



Reliability and Equivalence Estimates 

Tables of reliability and validity estimates are presented in the appendix. Split- 
half reliability and equivalence estimates of computer-based and paper-based measures 
from the pooled within-groups correlation matrices for the different groupings are tabu- 
lated in Table 1. It can be seen that the adjusted reliability estimates of the computer- 
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based and paper-based measures are from moderate to high for the different groupings 
ranging from: (a) .73 to .97 for F-14 RIO and pilot and E-2C NFO and pilot, (b) .74 to 
.97 for above and below mean flight hours, and (c) .53 to .95 for student, instructor, 
and other. None of the differences in corresponding reliabilities for computer-based 
and paper-based measures, i.e., test score and average degree of confidence, were 
found to be statistically significant (p > .01) using a test described by Edwards (1964). 
This suggested that the computer-based and paper-based measures were not 
significantly different in reliability or internally consistency. 

Considering the computer-based measures for all groupings, it was ascertained 
that the reliability estimate for average degree of confidence was significantly (p < .01) 
higher than the reliability estimates for average response latency and test score. Also, 
the reliability estimate for response latency was significantly higher than the one com- 
puted for test score. Focusing on the paper-based measures for all groupings, it was 
found that the reliability estimate for average degree of confidence was significantly (p 
< .01) higher than the reliability estimate for test score. These results implied that 
these measures can be ranked in order of their internal consistencies from highest to 
lowest as follows: average degree of confidence, average response latency, and test 
score. 

Equivalence estimates for the different groupings reported in the same order as 
above for test score and average degree of confidence measures, respectively, were 
.76 and .82, .76 and .82, and .50 and .76. These suggested that the computer-based 
and paper-based measures had anywhere from 25% to 67% variance in common 
implying that these different modes of assessment were somewhat or partially 
equivalent Equivalence is somewhat limited by the low reliability obtained for the 
computer-based measure of test score for the grouping: students, instructors, or others. 
For the F-14/E-2C and mean flight hours groupings, the equivalences for test score and 
average degree of confidence measures were not significantly (p > .01) different. How- 
ever, for the student/mstructor grouping, the equivalences of these measures were 
found to be significantly (p < .01) different. These results are ambiguous in that some 
of them suggest that the equivalence estimates for test score and average degree of 
confidence measures are about the same; while, the other suggests that these estimates 
are different. 



Discriminant Validity Estimates 

Above or Below F-14 or E.2C Mean Flight Hours 

The discriminant analysis computed to determine how well computer-based and 
paper-based measures differentiated groups defined by above or below F-14 or E-2C 
mean flight hours yielded one significant discriminant function. According to the multi- 
ple discriminant analysis model (Cooley & Lohnes, 1962; Tatsuoka, 1971; Van de 
Geer, 1971), the maximum number of derived discriminant functions is either one less 
than the number of groups or equal to the number of discriminating variables, which- 
ever is smaller. Since there were four groups to be discriminated, this analysis yielded 
three discriminant functions, but only one of them was significant. Consequently, 
solely this significant discriminant function and its associated statistics are presented. 



The statistics associated with the significant function, standardized discriminant- 
function coefficients, pooled within-groups correlations between the function and 
computer-based and paper-based measures, and group centroids for above or below F- 
14 or E-2C mean flight hours are presented in Table 2. It can be seen that the single 
significant discriminant function accounted for approximately 82% of the variance 
among the four groups- The discriminant-function coefficients which consider the 
interactions among the multivariate measures revealed the relative contribution or com- 
parative importance of these variables in defining this derived dimension to be the 
paper-based test total score (PTS), the computer-based test total score (CTS), and the 
computer-based test total average degree of confidence (CTC), respectively. The 
computer-based test total average latency (CTL) and the paper-based test total average 
degree of confidence (PTQ were considered unimportant in specifying this discrim- 
inant function since the absolute value of their coefficients were each below A. The 
within-groups correlations which are computed for each individual measure partiallrg 
out the interactive effects of all the other variables indicated that the major contributors 
to the significant discriminant function were CTC, CTS, and CTL, respectively, all 
computer-based measures. The group centroids showed how the performance of the 
F-14 crew members clustered together along one end of the derived dimension; while, 
the performance of the E-2C crew members clustered together along the other end of 
the continuum. The means and standard deviations for groups above or below F-14 or 
E-2C mean flight hours, univariate F-ratios, and levels of significance for computer- 
based and paper-based measures are tabulated in Table 3. Considering the measures as 
univariate variables, i.e., independent of their multivariate relationships with one 
another, these statistics revealed that the three computer-based measures CTC, CTS, 
and CTL, respectively, significanfly differentiated the four groups, not the paper-based 
measures, PTS and PTC. Applying Duncan's multiple range lest (Kirk, 1968) on the 
group means for the important individual measures indicated that F-14 crews 
significandy (p < -05) out performed E-2C crews on CTS, CTC, and CTL* The mul- 
tivariate and subsequent univariate results established the discriminant validity of 
computer-based measures to be superior to that of paper-based measures for the group- 
ing strategy: above or below F-14 or E-2C flight hours. 

F-14 RIOs or Pilots and E-2C NFOs or Pilots 

The statistics associated with the significant function, standardized discriminant 
function coefficients, pooled within-groups correlations between the function and 
computer-based and paper-based measures, and group centroids for F-14 RIOs or pilots 
and E-2C NFOs or pilots are presented in Table 4. A single significant discriminant 
function accounted for approximately 82% of the variance among the four groups. The 
discriminant-function coefficients revealed the relative contribution of the multivariate 
measures in defining this derived dimension to be PTS, CTS, CTL, and PTC, respec- 
tively. CTC was considered unimportant in specifying this discriminant function since 
the absolute value of its coefficient was below .4. The within-groups correlations for 
the measures indicated that the major contributors to the significant discriminant func- 
tion were CTC, CTS, CTL, and PTC, respectively. Seventy-five percentage of these 
were computer-based measures. The group centroids showed how the performance of 
the F-14 crew members clustered together along one end of the derived dimension; 
while, the performance of the E-2C aew members was spread out along the other end 
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of the continuum. The means and standard deviations for groups of F-14 RIOs or 
pilots and E-2C NFOs or pilots, univariate F-ratios, and levels of significance for 
computer-based and paper-based measures are tabulated in Table 5. Considering the 
measures as univariate variables, these statistics revealed that the three computer-based 
measures CTL, CTS, CTC, and one paper-based measure, PTC, respectively, 
significantiy differentiated Uie four groups. Applying Duncan's multiple range test on 
the group means for these individual measures indicated that (a) F-14 crews 
significantiy (p < .05) out performed E-2C crews on CTS and CTC; and (b) F-14 crew 
members and E-2C NFOs significantiy out performed E-2C pilots on CTL and PTC 
measures. The multivariate and univariate results established the discriminant validity 
of the computer-based measures to be greater than the paper-based measures for the 
grouping strategy: F-14 RIOs or pilots and E-2C NFOs or pilots. 

VF-124 Students and Instructors or Members of Other Operational Squa- 
drons 

The statistics associated with the significant function, standardized discriminant- 
function coefficients, pooled within-groups correlations between the function and 
computer-based and paper-based measures, and group centroids for VF-124 students 
and instructors or members of other operational squadrons are presented in Table 6. A 
single significant discriminant function accounted for approximately 98% of the vari- 
ance among the three groups. The discriminant-function coefficients revealed the rela- 
tive contribution of the multivariate measures in defining this derived dimension to be 
CTS and CTC, respectively. The within-groups correlations for the measures indicated 
that the major contributors to the , significant discriminant function were CTS, CTC, 
PTS, and PTC, respectively. Half of these were computer-based measures, and half 
were paper-based measures. The group centroids showed how the performances of the 
students, instructors, and others were spread out along the entire dimension. The 
means and standard deviations for groups of VF-124 students and instructors or 
members of other operational squadrons, univariate F-ratios, and levels of significance 
for computer-based and paper-based measures are tabulated in Table 7. Considering 
the measures as univariate variables, these statistics revealed that all three computer- 
based measures CTS, CTC, CTL, and the two paper-based measures, PTS and PTC, 
respectively, significantly differentiated the three groups. Applying Duncan's multiple 
range test on the group means for these individual measures indicated that (a) students 
significantiy (p < .05) out performed instructors who in turn did better than members 
of other operational squadrons on CTS; (b) students and instructors did equally well 
but significantiy out performed members of other operational squadrons on CTC, CTL, 
and PTC; and (c) students did significantiy better than instructors and others who per- 
formed equally well on PTS. The multivariate and univariate results established the 
discriminant validity of the computer-based measures to be higher than paper-based 
measures for the grouping strategy: VF-124 students and instructors or members of 
other operational squadrons. 
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General Discriminant Validity 

Distinguishing among the groups formed by the three grouping strategies sug- 
gested that, generally, the discriminant validity of the computer-based measures was 
superior to that of the paper-based measures. 

Discussion and Conclusions 

This study established that (a) computer-based and paper-based measures, i.e., test 
score and average degree of confidence, are not significantly different in reliability or 
internal consistency; (b) for computer-based and paper-based measures, average degree 
of confidence has a higher reliability than average response latency which in turn has a 
higher reliability than the test score; (c) a few of the findings are ambivalent since 
some results suggesr equivalence estimates for computer-based and paper-based meas- 
ures, i.e., test score and average degree of confidence, are about the same, and another 
suggests these estimates are different; and (d) the discriminant validity of the 
computer-based measures was superior to paper-based measures. The results of this 
research supported the findings of some studies, but not others. The reported literature 
on this subject is contradictory and inconclusive. 

The consequences of computer-based assessment on examinee?' performance are 
not obvious. The few studies that have been conducted on this tc i>ic have produced 
mixed results. Investigations of computer-based administration of personality items 
have yielded reliability and validity indices comparable to typical paper-based adminis- 
tration (Katz & Dalby, 1981; Lushene, O'Neil, & Dunn, 1974). No significant 
differences were found in the scores of measures of anxiety, depression, and psycho- 
logical reactance due to computer-based and paper-based administration (Lukin, Dowd, 
Plake, & Kraft, 1985). Studies of cognitive tests have provided inconsistent findings 
with some (Rock & Nolen, 1982; Hitti, Riffer, & Stuckless, 1971) demonstrating that 
the computerized version is a viable alternative to the paper-based version. Other 
research (Hansen & O'Neil, 1970; Hedl, O'Neil, & Hansen, 1973; Johnson & White, 
1980; Johnson & Johnson, 1981), though, indicated that interacting with a computer- 
based system to take an intelligence test could elicit a considerable amount of anxiety 
which could affect performance. 

Some studies (Serwer & Stolurow, 1970; Johnson & Mihal, 1973) demonstrated 
that testees do better on verbal items given by computer than paper-based; however, 
just the opposite was found by other studies (Johnson & Mihal, 1973; Wildgrube, 
1982). One investigation (Sachar & Fletcher, 1978) yielded no significant differences 
resulting from computer-based and paper-based modes of administration on verbal 
items. Two studies (English, Reckase, & Patience, 1977; Hoffman & Lundberg, 1976) 
demonstrated that these two testing modes did not affect performance on memory 
retrieval items. Sometimes (Johnson & Mihal, 1973) testees performed better on quan- 
titative tests when computer given; sometimes (Lee, Moreno, & Sympson, 1984) they 
performed worse; and other times (Wildgrube, 1982) it may make no difference. Other 
studies have supported the equivalence of computer-based and paper-and-paper 
administration (Elwood & Griffin, 1972; Hedl, O'Neil, & Hansen, 1973; Kantor, 1988; 
Lukin, Dowd, Plake, & Kraft, 1985). Some researchers (Evan & Miller, 1969; Koson, 
Kitchen, Kochen, & Stodolosky, 1970; Lucas, MuUin, Luna, & Mclnroy, 1977; Lukin, 
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Dowd, Plake, & Kraft, 1985; Skinner & Allen, 1983) have reported comparable or 
superior psychometric capabilities of computer-based assessment relative to paper- 
based assessment in clinical settings. 

Regarding computerized adaptive testing (CAT), some empirical comparisons 
(McBride, 1980; Sympson, Weiss, & Ree, 1982) yielded essentially no change in vali- 
dity due mode of administration. However, test item difficulty may not be indifferent 
to manner of presentation for CAT (Green, Bock, Humphreys, Linn, & Reckase, 
1984). When going from paper-based to computer-based administration, this mode 
effect is thought to have three aspects: (a) an overall mean shift where all items may 
be easier or harder, (b) an item mode interaction where a few items may be altered 
and others not, and (c) the nature of the task itself may be changed by computer 
administration. A computer simulation study (Divgi, 1988) demonstrated that a CAT 
version of the Armed Services Vocational Aptitude Battery had higher reliability than 
a paper-based version for these subtests: General Science, Arithmetic Reasoning, Word 
Knowledge, Paragraph Comprehension, and Mathematics Knowledge. These incon- 
sistent results of mode, manner, or medium of testing may be due to differences in 
methodology, test content, population tested, or the design of the study (Lee, Moreno 
& Sympson, 1984). 

With computer costs coming down and peoples' knowledge of these systems 
going up, it becomes more likely economically and technologically that many benefits 
can be gained from their use. Some indirect advantages of computer-based assessment 
are increased test security, less ambiguity about students' responses, minimal or no 
paperwork, immediate scoring, and automatic records keeping for item analysis (Green, 
1983a, 1983b). Some of the strongest support for computer-based assessment is based 
upon the awareness of faster and more economical measurement (Elwood & Griffin, 
1972: Johnson & White, 1980; Space, 1981). Cory (1977) reported some advantages 
of computerized over paper-based testing for predictin'j on job performance. 

Ward (1984) stated that computers can be employed to augment: what is possible 
with paper-based measurement, e.g, to obtain more precise information regarding a stu- 
dent than is likely with more customary measurement methods, and to assess addi- 
tional aspects of performance* He discussed potential benefits that may be derived 
from employing computer-based systems to administer traditional tests. Some of these 
are as follows: (a) individualizing assessment, (b) increasing the flexibility and 
efficiency for managing test information, (c) enhancing the economic value and mani- 
* pulation of measurement databases, and (d) improving diagnostic testing. Millman 
(1984) claimed to agi'ee with Ward, especially regarding the ideas that computer-based 
measurement encourages: individualizing assessment, designing software within the 
context of cognitive science, and limiting computer-based assessment is not hardware 
inadequacy but incomplete comprehension of the processes intrinsic to testing and 
knowing per se (Federico, 1980). 

Sampson (1983) discussed some of the potential problems associated with 
computer-based assessment: (a) not taking into account human factors principles to 
design the human-computer interface, (b) individuals becoming so anxious when 
interacting with a computer for assessment that the measurement obtained may be 
questionable, (c) possibility of unauthorized access and invasion of privacy, (d) inaccu- 
rate test interpretations by users of the system culminate in erroneously drawn 
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conclusions, (e) differences in modes of administration making paper-based norms 
inappropriate for computer-based assessment, (f) lack of reporting reliability and vali- 
dity data for computerized tests, and (g) resistance toward using new computer-based 
systems for performance assessment. A potential limitation of computer-based assess- 
ment is depersonalization and decreased opportunity for observation. This is especially 
true in clinical environments (Space, 1981). Most computer-based tests do not allow 
individuals to omit or skip items, or to alter earlier responses. This procedure could 
change the test-taking strategy of some examinees. To permit it, however, would prob- 
ably create confusion and hesitation during the process of retracing through items as 
the testee uses clues from some to minimize the degree of difficulty of others (Green, 
Bock, Humphreys, Linn, & Reckase, 1984). 

Hofer and Green (1985) were concerned that computer-based assessment would 
introduce irrelevant or extraneous factors that would likely degrade test performance. 
These computer-correlated factors may alter the nature of the task to such a degree, it 
would be difficult for a computer-based test and its paper-based counterpart to measure 
the same construct or content. This could impact upon reliability, validity, normative 
data, as well as other assessment attributes. They listed sf^cral factors which might 
contribute to different performances on these distinct kind , , i testing: (a) state anxiety 
instigated when confronted* by computer-based testing, (b) lack of computer familiarity 
on the part of the testee, and (c) changes in response format required by the two 
modes of assessment. These different dimensions could result in tests that are none- 
quivalent; however, in this reported research, these diverse factors had no apparent 
impact. 

A number of known differences between computer-based and paper-based assess- 
ment which may affect equivalence and validity are as follows: No passive omitting of 
items is usually permitted on computer-based tests. An individual must respond unlike 
most paper-based tests. Computerized tests typically do not permit backtracking. The 
testee caimot easily review items, alter responses, or delay attempting to answer ques- 
tions. The capacity of the computer screen can have an impact on what usually are 
long test items, e.g., paragraph comprehension. These may be shortened to accommo- 
date the computer display, thus partially changing the nature of the task. The quality of 
computer graphics may affect the comprehension and degree of difficulty of the item. 
Pressing a key or using a mouse is probably easier than marking an answer sheet. This 
may impact upon the validity of speeded tests. Since the computer typically displays 
items individually, traditional time limits are no longer necessary. The multidimen- 
sionality of achievement tests has implications for scoring CATs (Green, 1986). 

Some of the comments made by Colvin and Clark (1984) concerning instructional 
media can easily be extrapolated to assessment media. (Training and testing are inex- 
tricably intertwined; it is difficult to do one well without the other.) This is especially 
appropriate regarding some of the attitudes and assumptions permeating the employ- 
ment of, and enthusiasm for, media: (a) confronted with new media, computer-based or 
otherwise, students will not only work harder, but also enjoy their training and testing 
more; (b) matching training and testing content to mode of presentation is important, 
even though not all that prescriptive or empirically well established; (c) the application 
of computer-based systems permits self-instruction and self-assessment with their con- 
comitant flexibility in scheduling and pacing training and testing; (d) monetary and 



human resources can be invested in designing and developing computer-based media 
for instruction and assessment that can be used repeatedly and amortized over a longer 
time, rather than in labor intensive classroom-based training and testing; and (e) the 
stability and consistency of instruction and assessment can be improved by media, 
computer-based or not, for distribution at different times and locations however 
remote. 

Evaluating or comparing different media for instruction and assessment, one must 
be aware that the newer medium may simply be perceived as being more novel, 
interesting, engaging, and challenging by the students. This novelty effect seems to 
disappear as rapidly as it appears. However; in research studies conducted over a rela- 
tively short time span, e.g., a few days or months at the most, this effect may still be 
lingering and affecting the evaluation by enhancing the impact of the more novel 
medium (Colvin & Clark, 1984). When matching media to distinct subject matters, 
course contents, or core concepts, some research evidence (Jamison, Suppes, & Welles, 
1974) indicates that, other than in obvious cases, just about any medium will be 
effective for different content. 

As is evident, the literature regarding computer-based assessment is contradictory 
and inconclusive: Many benefits may be obtained fipm computerized testing. Some of 
these may be related to attitudes and assumptions associated with the use of novel 
media or innovative technology per se. However, and just as readily, potential prob- 
lems may result from the employment of computer-based measurement. Differences 
between this mode of assessment and traditional testing techniques may, or may not, 
impact upon the reliability and validity of measurement. 

In this study, it was found that computer-based and paper-based testing were not 
significantly different in reliability with the former having more discriminant validity 
than the latter. These results suggest that computer-based assessment may have more 
utility for measuring semantic knowledge than paper-based measurement. This implies, 
that the type of computerized testing used in this research may be better for estimating 
threat-parameter knowledge than traditional testing which has been primarily paper- 
based in nature. 

A salient question that needs to be addressed is how to combine effectively and 
efficiently computer and cognitive science, artificial intelligence (AI), current 
psychometric theory, and diagnostic testing. AI techniques can be developed to diag- 
nose specific error-response patterns or bugs to advance measurement methodology 
(Brown & Burton, 1978; Kieras, 1987; McArthur & Choppin, 1984). 

Recomm^ Tidations 

1. It is recommended that the compuier-based test, FlashCards, be used to not 
only quiz but also train the threat-parameter database to F-14 and E-2C crew 
members. Currently, FlashCards and Jeopardy (the Computhreat system) are being 
used by VF-124 to augment the teaching and testing of threat parameters. 

2. Other computer-based quizzes being developed at NPRDC should be used in 
different content areas to provide evidence on the generalizabiltiy of the reliability and 
validity findings established in this research. 
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TABLES OF RELIABILITY AND VALIDITY ESTIMATES 
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for Computer ^sed and Paper-and-Pencil Measures A-3 

^. Statistics Associated with Significant Discriminant Function, 

Standardized Discriminant-Function Coefficients, Pooled Within-Groups 

Correlations Between the Discriminant Function and Computer-Based 

and Paper-and-Pencil Measures, and Group Centroids for F-1^ RIOs or 

Pilots and E-2C NFOs or Pilots A-^ 

5. Means and Standard Deviations for Groups of F-1^ RIOs or Pilots and 
E-2C NFOs or Pilots, Univariate F-Ratios, and Levels of Significance 
for Computer-Based and Paper-and-Pencil Measures A-5 



6. Statistics Associated with Significant Discriminant Function, 

Standardized Discriminant-Function Coefficients, Pooled Within-Groups 
Correlations Between the Discriminant Function and Computer-Based 
and Paper-and-Pencil Measures, and Group Centroids for VF-12^ Students and 



Instructors or Members of Other Operational Squadrons A-6 

7» Means and Standard Deviations fo: Groups of VF-12^ Students and 
Instructors or Members of Other Operational Squadrons, Univariate 
F-Ratios, and Levels of Significance for Computer-Based and 
Foper-and-Pencil Measures A-7 



ERLC 



A-O 



Table 1 

Split-Half Reliability and Equivalence Estimates of Computer-Based 
and Paper-and-Pencil Measures from Pooled Within-Groups Correlation 
Ma'irices for Different Groupings 



Grouping Above or Below Mean Flight Hoiu^ 



Measure 



Reliability 



Computer- Paper-and- 



Based 



Pencil 



Equiva- 
lence 



Score 
Confidence 
Latency 



.74 
.96 



.76 
.97 



.76 
.82 



Grouping F-14 RIOs/Pilots. E2-C NFOs/Pilots 



Measure 



Reliability 

Equiva- 

Computer- Paper-and- lence 



Based 



Pencil 



Score 
Confidence 
Latency 



.73 

.95 
.86 



.77 
.97 



.76 

.82 



Grouping Students, Instuctors, or Others 



Reliability 



^ Computer- Paper-and- 

^^^^ Based Pencil 



Equiva- 
lence 



Score 
Confidence 
Latency 



.53 
.94 



.62 
.95 



.50 
.76 



Note. Split-half reliabUity estimates were adjusted by 
employing the Spearman-Brown Prophecy Formula. 



Table 2 



Siaiistics Associated with Significant Discriminant Function, Standardized 
Discriminant-Function Coefficients, Pooled Wiihin-Groups Correlations Between 
the Discriminant Function and Computer-Based and Paper-and-Pencil Measures, 
and Group Centroids for Above or Below F-14 or E-2C Mean Flight Hours 



Discriminant Function 



Eigen- 
value 


Percent 
Variance 


Canonical 
Correlation 


Wilks 
Lambda 


Chi 
Squared 


d.f. 


P 


.44 


82.43 


.55 


.64 


31.38 


15 


.008 


Measure 


Discriminant 
Coefficient 


Within-Group 
Correlation 






Group 


Centroid 




CTS 
CTC 
CTL 
PTS 


.91 
.84 
-.24 
-1.19 


.51 
.57 
-.45 
-.00 






Above F-14 
Mean Hours 

Below F-14 
Mean Hours 

Above E-2C 
Mean Hours 


.10 

.39 
-1.35 




PTC 


-.17 


.36 






Below E-2C 
Mean Hours 


-1.50 





Table 3 



Means and Standard Deviations for Groups Above or Below F-14 
or E2-C Mean Eight Hours, Univariate F-Ratios, and Levels of 
Signifigance for Computer-Based and Paper-and-Pencil Measures 



Group 



Measure 




Above F-14 
Flight Hours 
(n=26) 


Below F-14 
Flight Hours 
(n=37) 


Above E-2C 
Flight Hours 
(n=5) 


Below E-2C 
Flight Hours 
(n=7) 


F 


n 
F 


CTS 


X 
s 


60.58 
15.75 


59.62 
18.77 


44.60 
15.68 


43.14 
17.37 


2.94 


.039 


CTC 


Y 
/v 


75 58 
21.57 


80.84 
19.80 


48.60 
21.23 


64.57 
26.48 


4.11 


.010 


CTL 


X 
§ 


8.42 
3.31 


7.81 
2.77 


9.49 
4.10 


11.06 
3.94 


2.28 


.087 


PTS 


X 
§ 


51.65 
18.26 


49.73 
20.38 


45.80 
11.86 


52.86 
13.91 


.19 


.900 


PTC 


X 
s 


72.23 
23.02 


76.70 
18.10 


53.00 
16.55 


69.71 
20.94 


2.14 


.103 



ERIC 



A-3 o r\ 



Table 4 



Siaiistics Associated with Significant Discriminant Function, Standardized 
Discriminant-Function Coefficients, Pooled Within-Groups Correlations Between 
the Discriminant Function and Computer-Based and Papcr-and-Pencil Measures, 
and Group Centroids for F-14 RIOs or Pilots and E-2C NFOs or Pilots 



Discriminant Function 



eigen- 




v.4inonicax 


Wilks 


Phi 


d.f. p 


value 


Variance 


Correlation 


Lambda 


Squared 


.66 


81.96 


.63 


.53 


44.72 


15 .000 


Measure 


Discriminant 
Coefficient 


Witiiin-Group 
Correlation 






Group 


Centroid 


CTS 


-.73 


-.48 






F-14 RIOs 


-.32 


CTC 


-.32 


-.52 






F-14 Pilots 


-.21 


CTL 


.57 


.58 










PTS 


-1.15 


-.05 






E-2C NFOs 


.58 


PTC 


-.45 


-.45 






E-2C Pilots 


3.13 



ERIC 



A-4 

30 



Table 5 



Means and Standard Deviations for Groups of F-14 RiOs or Pilots 

and E2-C NFOs or Pilots. Univariate F-Ratios. and Levels of 
Signifigance for Computer-Based and Paper-and-Pencil Measures 



Group 



Measure 




F-14 RIOs 
(n=37) 


F-14 Pilots 
(n=26) 


E-2C NFOs 
(n=8) 


E-2C Pilots 
(n=4) 


F 


P 


CTS 


X 
s 


60.57 
17.46 


59.23 
17.77 


48.88 
9.11 


33.50 
23.01 


3.74 


.015 


CTC 


X 
s 


20.67 


^^ no 
20.66 


oj.ju 
18.80 


HZ. / J 

31.08 


4.39 


.007 


CTL 


X 
s 


8.18 
3.42 


7.88 
2.30 


8.40 
2.49 


14.43 
3.00 


5.84 


.001 


PTS 


X 
s 


50.68 
19.87 


50.31 
19.11 


51.38 
11.78 


47.00 
16.79 


.05 


.984 


PTC 


X 
s 


76.54 
21.72 


72.46 
18.11 


72.38 
11.44 


43.50 
21.63 


3.42 


.022 



A-5 



Table 6 



Statistics Associated with Significant Discriminant Function, 
Standardized Discriminant-Function Coefficients, Pooled Within-Groups 
Correlations Between the Discriminant Function and Computer-Based and 
Papcr-and-Pencil Measures, and Group Ceniroids for VF-124 Students 
and Instructors or Members of Other Operational Squadrons 



Discriminant Function 



Eigen- 
value 


Percent 
Variance 


Canonical 
Correlation 


Wilks 
Lambda 


Chi 
Squared 


a.i. p 








.40 


/in 


in (w\ 


Measure 


Discriminant 
Coefficient 


Within-Group 
Correlation 






Group 


Cenlroid 


CTS 


.62 


.86 






Students 


1.34 


CTC 


.50 


.70 










CTL 


.02 


-.32 






Instructors 


.05 


PTS 


.24 


.67 










PTC 


-.45 


-.45 






Others 


-1.20 



A-6 
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Table 7 



Means and Standard Deviations for Groups of VF-124 Students and 
Instructors or Members of Other Operational Squadrons.Univariate F-Ratios, 
and Levels of Signifigance for Computer-Based and Paper-and-Pencil Measures 









Group 








Measure 




Students 
(n=3U) 


Instructors 
(n=ll) 


Others 
(n=j4; 


F 


P 


CTS 


X 
s 


72.33 
13.30 


57.36 
16.30 


44.26 
11.03 


38.30 


.000 


CTC 


X 
s 


91.10 
11.83 


78.91 
16.22 


60.29 
21.52 


25.06 


•UUO 


CTL 


X 
s 


7.30 
2.80 


7.50 
2.50 


9.73 
3.41 


5.63 


.005 


PTS 


X 

s 


63.97 
13.81 


48.27 
18.33 


39.18 
14.00 


23.09 


.000 


PTC 


X 

s 


85.03 
16.99 


75.36 
14.61 


61.44 
18.99 


14.37 


.000 
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